plot_ly() and ggplotly().plot_geo().We will work with two Starbucks datasets, one on the store locations (global) and one for the nutritional data for their food and drink items. We will do some text analysis of the menu items.
Upload an html file to Quercus and make sure the figures remain interactive.
sb_locs <- read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/starbucks-locations.csv")
## Rows: 25600 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Store Number, Store Name, Ownership Type, Street Address, City, St...
## dbl (2): Longitude, Latitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sb_nutr <- read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/starbucks-menu-nutrition.csv")
## Rows: 205 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Item, Category
## dbl (5): Calories, Fat (g), Carb. (g), Fiber (g), Protein (g)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
usa_pop <- read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/us_state_pop.csv")
## Rows: 55 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (1): population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
usa_states<-read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/states.csv")
## Rows: 51 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): State, Abbreviation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sb_locs)
## # A tibble: 6 Ă— 12
## `Store Number` `Store Name` `Ownership Type` `Street Address` City
## <chr> <chr> <chr> <chr> <chr>
## 1 47370-257954 Meritxell, 96 Licensed Av. Meritxell, … Ando…
## 2 22331-212325 Ajman Drive Thru Licensed 1 Street 69, Al… Ajman
## 3 47089-256771 Dana Mall Licensed Sheikh Khalifa … Ajman
## 4 22126-218024 Twofour 54 Licensed Al Salam Street Abu …
## 5 17127-178586 Al Ain Tower Licensed Khaldiya Area, … Abu …
## 6 17688-182164 Dalma Mall, Ground Flo… Licensed Dalma Mall, Mus… Abu …
## # ℹ 7 more variables: `State/Province` <chr>, Country <chr>, Postcode <chr>,
## # `Phone Number` <chr>, Timezone <chr>, Longitude <dbl>, Latitude <dbl>
head(sb_nutr)
## # A tibble: 6 Ă— 7
## Item Category Calories `Fat (g)` `Carb. (g)` `Fiber (g)` `Protein (g)`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chonga Bagel Food 300 5 50 3 12
## 2 8-Grain Roll Food 380 6 70 7 10
## 3 Almond Croi… Food 410 22 45 3 10
## 4 Apple Fritt… Food 460 23 56 2 7
## 5 Banana Nut … Food 420 22 52 2 6
## 6 Blueberry M… Food 380 16 53 1 6
head(usa_pop)
## # A tibble: 6 Ă— 2
## state population
## <chr> <dbl>
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 Arkansas 2915918
## 5 California 37253956
## 6 Colorado 5029196
head(usa_states)
## # A tibble: 6 Ă— 2
## State Abbreviation
## <chr> <chr>
## 1 Alabama AL
## 2 Alaska AK
## 3 Arizona AZ
## 4 Arkansas AR
## 5 California CA
## 6 Colorado CO
sb_locs = Starbucks locations, including their street
address, the city, the state/province, and longitude and latitudesb_nutr = the nutrient information of the items on the
Starbucks menus, including calories, fat, car, fiber, proteinusa_pop = population in the different states in
usausa_states = abbreviations of all the us statessb_usa <- sb_locs |> filter(Country == "US")
sb_locs_state <- sb_usa |>
group_by("State/Province") |>
rename(state = "State/Province") |>
group_by(state) |>
summarize(n_stores = n())
# need state abbreviations
usa_pop_abbr <-
full_join(usa_pop, usa_states,
by = join_by(state == State)
)
sb_locs_state <- full_join(sb_locs_state, usa_pop_abbr,
by = join_by(state == Abbreviation))
summary(sb_locs_state)
## state n_stores state.y population
## Length:55 Min. : 8.0 Length:55 Min. : 56882
## Class :character 1st Qu.: 56.5 Class :character 1st Qu.: 1344331
## Mode :character Median : 123.0 Mode :character Median : 3751351
## Mean : 266.8 Mean : 5677621
## 3rd Qu.: 332.0 3rd Qu.: 6515716
## Max. :2821.0 Max. :37253956
## NA's :4
For the numeric values, the number of stores ranges from 8 stores per state/province to 2821 stores per state/province. The population ranges from 56882 to 37253956. The other variables are of type character.
ggplotly for EDAAnswer the following questions:
Are the number of Starbucks proportional to the population of a state? (scatterplot)
Is the caloric distribution of Starbucks menu items different for drinks and food? (histogram)
What are the top 20 words in Starbucks menu items? (bar plot)
p1 <- ggplot(sb_locs_state, aes(x = population, y = n_stores,
colour = state)) +
geom_point(alpha = 0.8) +
theme_bw()
ggplotly(p1)
The plot does appear to show a general trend, where as the population grows, the number of Starbucks stores also increases. However, the correlation here is not strictly linear, as some states (like the state with the abbreviation WA) have more stores than expected based on their population. But overall, yes, the number of Starbucks stores is correlated with population, but not perfectly proportional.
p2 <- ggplot(sb_nutr, aes(x = Calories, fill = Category)) +
geom_histogram(alpha = 0.6) +
theme_bw()
ggplotly(p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Yes, the caloric distribution of Starbucks menu items is different for drinks and food. The majority of drinks have a lower calorie count, clustering between 0 calories and 300 calories, whereas foods mainly cluster between 100 calories all the way to 600 calories. So, drinks tend to have a lower caloric range, while food items generally have higher calorie values.
p3 <- sb_nutr |>
unnest_tokens(word, Item, token = "words") |>
count(word, sort = T) |>
head(20) |>
ggplot(aes(fct_reorder(word, n), n)) +
geom_col() +
coord_flip() +
theme_bw()
ggplotly(p3)
The top word is iced, as it is seen 32 times in the
sb_nutr dataset. tazo, bottled,
sandwich, and chocolate are also in the top 20
words.
plot_ly()plot_ly() representing the
relationship between calories and carbs. Color the points by category
(food or beverage). Is there a relationship, and do food or beverages
tend to have more calories?sb_nutr |>
plot_ly(x = ~Calories, y = ~`Carb. (g)`,
type = "scatter", mode = "markers", color = ~Category)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Yes, there is a relationship between calories and carbs, as in the scatterplot, generally higher caolories lead to higher carbs. Food tends to have more calories than drinks.
hovermode = "compare".topwords <- sb_nutr |>
unnest_tokens(word, Item, token = "words") |>
group_by(word) |>
summarise(word_frequency = n()) |>
arrange(across(word_frequency, desc)) |>
head(10)
sb_nutr |>
unnest_tokens(word, Item, token = "words") |>
filter(word %in% topwords$word) |>
plot_ly(
x = ~Calories,
y = ~`Carb. (g)`,
type = "scatter",
mode = "markers",
color = ~Category,
hoverinfo = "text",
text = ~paste0("Item: ", word)
) |>
layout(
title = "Cal vs Carbs",
xaxis = list(title = "Calories"),
yaxis = list(title = "Carbs"),
hovermode = "compare"
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
plot_ly Boxplotssb_nutr_long <- sb_nutr |>
unnest_tokens(word, Item, token = "words") |>
filter(word %in% topwords$word) |>
pivot_longer(cols = c(Calories, `Fat (g)`, `Carb. (g)`,
`Fiber (g)`, `Protein (g)`), names_to = "Nutrient", values_to = "value")
plot_ly(data = sb_nutr_long,
x = ~word,
y = ~value,
color = ~Nutrient,
type = "box" ) |>
layout(
title = "Nutrient values for the top 10 word items",
xaxis = list(title = "Item word"),
yaxis = list(title = "Nutritional Value")
)
The top word that has the most calories are sandwiches, as their median is the highest (460), and their tails are the highest - indicating that in general they have more calories compared to the other words. The words that have the highest protein are sandwich and egg. Both their medians are pretty similar (19 and 18.5 respectively). The box (and the tails) for the word sandwich are larger, indicating more variation compared to the protein amount in the word egg.
sb_nutr_long <- sb_nutr |>
unnest_tokens(word, Item, token = "words") |>
filter(word %in% topwords$word[1:10]) |>
plot_ly(
x = ~Calories,
y = ~`Carb. (g)`,
z = ~`Protein (g)`,
color = ~word,
type = "scatter3d",
mode = "markers",
marker = list(size = 5)) |>
layout(
title = "3D scatterplot of Calories, Carbs, and Protein",
scene = list(
xaxis = list(title = "Calories"),
yaxis = list(title = "Carbohydrates (g)"),
zaxis = list(title = "Protein (g)")
)
)
sb_nutr_long
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
There seems to be a very weak overall trend (as it is quite spread out). For the most part, I can see that higher calories generally lead to higher carbs and protein. So yes, there does seem to be an overall trend.
plot_ly Map# Set up mapping details
set_map_details <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
# Make sure both maps are on the same color scale
shadeLimit <- 125
# Create hover text
sb_locs_state$hover <- with(sb_locs_state, paste("Number of Starbucks: ", n_stores, '<br>', "State: ", state.y, '<br>', "Population: ", population))
# Create the map
map1 <- plot_geo(sb_locs_state, locationmode = "USA-states") |>
add_trace(z = ~n_stores, text = ~hover, locations = ~state, color = ~n_stores, colors = "Purples") |>
layout(
title = "starbucks store by state", geo = set_map_details)
map1
## Warning: Ignoring 4 observations
map2 <- plot_geo(sb_locs_state, locationmode = "USA-states") |>
add_trace(z = ~population, text = ~hover, locations = ~state, color = ~n_stores, colors = "Purples") |>
layout(
title = "starbucks store by state", geo = set_map_details)
map2
subplot(map1, map2)
## Warning: Ignoring 4 observations
When put side by side, I do think that states with higher population tend to have a higher number of Starbucks stores. For instance, California has the highest population and also the highest number of Starbucks stores, Texas has the second highest population and the second largest number of stores. Overall, the maps look relatively similar (color-wise). Therefore, there isn’t much of a difference.